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Abstract 

This paper considers a cross-layer adaptive modulation system that is modeled as a Markov decision 
process (MDP). We study how to utilize the monotonicity of the optimal transmission policy to relieve 
the computational complexity of dynamic programming (DP). In this system, a scheduler controls the 
bit rate of the m-quadrature amplitude modulation (m-QAM) in order to minimize the long-term losses 
incurred by the queue overflow in the data link layer and the transmission power consumption in the 
physical layer. The work is done in two steps. Firstly, we observe the T^’-convexity and submodularity 
of DP to prove that the optimal policy is always nondecreasing in queue occupancy/state and derive the 
sufficient condition for it to be nondecreasing in both queue and channel states. We also show that, due 
to the -convexity of DP, the variation of the optimal policy in queue state is restricted by a bounded 
marginal effect: The increment of the optimal policy between adjacent queue states is no greater than one. 
Secondly, we use the monotonicity results to present two low complexity algorithms: monotonic policy 
iteration (MPI) based on L*'-convexity and discrete simultaneous perturbation stochastic approximation 
(DSPSA). We run experiments to show that the time complexity of MPI based on T*’-convexity is much 
lower than that of DP and the conventional MPI that is based on submodularity and DSPSA is able to 
adaptively track the optimal policy when the system parameters change. 
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Fig. 1. Cross-layer adaptive m-QAM system. denotes the number of packets arrived at data link layer at time t. The 
packet arrival process is random. The scheduler controls the number of bits in the QAM symbol in order to minimize 

the queue overflow and transmission power consumption simultaneously and in the long run. 


cross-layer adaptive modulation, dynamic programming, L^-convexity, Markov decision process, 
stochastic approximation, submodularity. 


I. Introduction 

Fig. [J shows a cross-layer adaptive m-quadrature amplitude modulation (m-QAM) system. It is 
assumed that packets from higher layers (e.g., application layer) arrive at the data link layer randomly. 
They are buffered hy a first-in-first-out (FIFO) queue in the data link layer before the transmission. The 
physical layer adopts m-QAM scheme, where m, the constellation size, is controlled by a scheduler. In 
this system, m determines not only the transmission rate in the physical layer but also the departure rate 
of the queue in the data link layer. The objective of the scheduler is to minimize the queue overflow and 
transmission power consumption simultaneously by considering the queue occupancy/state and channel 
condition/state and their expectations in the long run. The optimization problem in Fig. [T] is a cross-layer 
one—It incorporates the idea of adaptive modulation in the physical layer III, |2l and the quality of 
service (QoS) concern associated with queueing effects in the data link layer. 

There are many research works concerning cross-layer adaptive m-QAM system in Fig. [TJ e.g., Il3l- 
|9l- In these works, by adopting finite-state Markov chain (FSMC) modeled wireless channel(s) ifTOl . the 
Markov decision process (MDP) model is proposed to formulate the dynamics (e.g., the statistics of the 
queue occupancy based on packet arrival probability and the variation of the channel state in FSMC) 
in the cross-layer adaptive m-QAM system and the optimal policy that minimizes the long-term losses 
incurred in both data link and physical layers is searched by a dynamic programming (DP) algorithm, 
e.g., value or policy iteration. The simulation results in these works show that scheduling across layers, 
instead of only one-layer, by considering the stochastic features of the system can provide good QoS 
and/or throughput in both data link and physical layers in the long run. 
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However, most of these studies focus on system model proposing and problem formulating without 
considering the computational complexity involved in solving the long-term optimization problems. DP 
is a well-known method to solve the MDP modeled optimization problems ifTll . However, the crucial 
limitation of DP is that its computation load grows drastically with the cardinalities of the state sets 
in MDP This problem is called the curse of dimensionality HIT} and makes DP inefficient for solving 
high dimensional MDP problems. Take the system in Fig. [T] for example. If the number of channel 
states in FSMC increases, the time complexity in each iteration of DP may grow quadratically; If the 
system is extended to a multi-user one with MIMO (multiple-input and multiple-output) channel, the time 
complexity of DP may grow exponentially with both the number of users and the number of channels. In 
addition, DP is not suitable for real-time transmission scheduling cases either. In practical applications, 
we wish to design a model-free reinforcement learning algorithm that is able to quickly converge to the 
optimal policy and adaptively track the optimum when the system parameters change. But, DP is an off¬ 
line algorithm, i.e., running DP requires the full knowledge of MDP, and it is hard for DP to converge in 
real time for a large-scale MDP system when computational resources are limited. Therefore, it is worth 
discussing how to relieve the computational complexity of DP for the cross-layer adaptive modulation 
system in Fig. [T] 

On the other hand, the studies in lfT3l - IIT6ll show that it is possible to propose low complexity and 
model-free algorithm in the cross-layer optimization problem if the optimal policy is monotonic. In 
ns, (Ml, lHU, a cross-layer adaptive modulation system with MIMO (multiple-input and multiple- 
output) channels is studied. The authors prove that the optimal transmission policy is nondecreasing in 
queue state/occupancy if the DP was submodular. In |[T3l . a modified policy iteration (MPI) algorithm is 
proposed based on the submodularity. It is shown that the MPI algorithm searches the optimal policy with 
lower complexity than DP. In IfTSI . a multi-user adaptive m-QAM system is modeled by a congestion 
game, where the optimal randomized policy, a randomized mixture of deterministic policies, is also 
nondecreasing in queue state due to the submodularity. The authors propose simultaneous perturbation 
stochastic approximation (SPSA) algorithm for the decision maker to learn the optimal randomized policy 
in real time. 

The main purpose of this paper is also to study how to utilize the monotonicity of the optimal 
transmission policy to relieve the computational complexity of DP in the cross-layer adaptive m-QAM 
system in Fig. [T] The study is based on the MDP formulation of the m-QAM adaptive modulation system 
proposed in ||3l, Q. Our work differs from the ones in ITSll - lIT^ in three aspects. Firstly, we establish the 
sufficient condition for the existence of a monotonic optimal transmission policy in not only the queue 
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State but also the channel state in the MDP. Secondly, we show that the monotonicity of the optimal 
policy in the queue state is due to the -convexity, a more strict property than submodularity such that 
the variation of the resulting optimal policy is not only nondecreasing but also restricted by a bounded 
marginal effect. We propose an MPI algorithm based on -convexity and show by experiment result that 
its complexity is much lower than the MPI algorithm based on submodularity as proposed in ifTSl . lfT4l . 
Thirdly, the optimal policy is deterministic instead of randomized as in lITSl . For the purpose of learning 
this optimal deterministic policy in real time, we propose to use a discrete simultaneous perturbation 
stochastic approximation (DSPSA) algorithm based on the gradient calculation method for -convexity 
in HU. 


A. Main Results 

The main results in this paper are listed as follows. 

• We prove that the optimal transmission policy is always nondecreasing in queue state due to the L^- 
convexity of DP. It is also shown that the variation of the optimal policy in queue state is restricted 
by a bounded marginal effect: The increment of the optimal policy between adjacent queue states 
is no greater than one, i.e., if the optimal modulation scheme is m-QAM for a certain queue state, 
then the optimal modulation scheme for its adjacent queue states must be m-QAM, (m -|- 1)-QAM 
or {m — 1)-QAM. 

• By observing the submodularity of DP, we derive the sufficient conditions for the optimal policy to 
be nondecreasing in both queue and channel states. We show that these conditions are satisfied if 
fhe channel experiences slowo and flat fading and a proper value of the weight factor (a coefficient 
in the immediate cost function) is chosen. 

• We present an MPI algorithm for searching the monotonic optimal policy based on the -convexity 
of DP. It is shown that the time complexity of MPI based on L^-covexity is much lower than the 
one based on submodularity in lfT3l . llT4ll and DP. 

• We prove that the optimal transmission policy can be determined by a set of monotonic queue 
thresholds. For this reason, the optimal policy can be searched by solving a constrained minimization 
problem over queue thresholds. For solving this problem, we propose to use DSPSA algorithm, a 
simulation-based line search method by using augmented Lagrangian penalty method, to approxi¬ 
mate the minimizer (the optimal queue thresholds). We run experiments to show the convergence 


'in this paper, we assume that the fading is slow with respect to the decision duration, i.e., the normalized Doppler frequency 
shift, the multiplication of maximum Doppler shift and decision duration is no greater than 0.01. 
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performance of DSPSA. We show that DSPSA is able to adaptively track the optimum and optimizer 
when the system parameters change. 

B. Paper Organization 

The rest of the paper is organized as follows. In Section HIl we descrihe the assumptions and MDP 
formulation, state the optimization objective and present DP algorithm for the adaptive m-QAM system 
in Fig. [T] In Section |IIIJ we study the existence of a monotonic optimal transmission policy in queue and 
channel states by observing the -convexity and submodularity of DP. In Section |IVj we present the MPI 
algorithm based on L^-convexity and compare its time complexity with the one based on submodularity 
and DP. In Section |Vl we convert DP to a discrete multivariate minimization problem with inequality 
constraints and show that the optimal policy can be approximated by a DSPSA algorithm. 

II. System and MDP Formulation 

Consider the cross-layer adaptive m-QAM system in Fig. [T] Messages from higher layer are encap¬ 
sulated in packets of equal length and stored in an FIFO queue in the data link layer. The output of 
queue is connected to an m-QAM transmitter in the physical layer, where the bit rate of the modulation 
scheme is controlled by a scheduler. The packets from higher layers (e.g., application layer) arrive at the 
queue in the data link layer randomly. The m-QAM transmitter sends packets through a wireless fading 
channel to the receiver. The optimization problem of the scheduler is to minimize queue overflow in the 
data link layer and transmission power consumption in the physical layer in the long run. 

A. Assumptions 

Let the decision making process be discrete, i.e., the time is divided into small intervals called decision 
epochs and denoted by t. Each decision epoch lasts for To seconds. Let the decision making process 
start from t = 0 and go on for infinitely long time, i.e., t G {0,1,... , oo}. In this system, we assume 
the followings. 

Assumption 2.1: Let Lp denote the length of packet in bits. The number of storage units (in packets) 
in LILO queue is < oo, i.e., the queue can store at most Lp packets, or LpLp bits. The newly 
arrived packets are dropped if there is a full queue occupancy. We call it packet loss due to the queue 
overflow. 

Assumption 2.2: The packet arrival process is i.i.d.. G {0,1,... , Lp} denotes the number 

of packets arrived at queue at t. 
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Assumption 2.3: Let E {0,1,, A^} denote the action taken by the scheduler at t, where the 
maximum action Am < Lb- Here, = 0 denotes no transmission, and the value of when ^ 0 
determines the number of bits in the QAM symbol that is transmitted by m-QAM transmitter at t, 
i.e., packet(s) are transmitted by 2“^*'-QAM except that = 0 denotes no transmission. If / 0, 
the number of symbols transmitted by m-QAM transmitter in one decision epoch is fixed to Lp. For 
example, if = 3, 3 packets, or 3Lp bits, depart from the queue. Each 3 bits are modulate to one 
2^-QAM symbol. The total Lp 2^-QAM symbols are transmitted through the wireless channel. So, 
also denotes the number of packets departing from the queue at f Q. Let Ts denote the symbol duration 
in seconds. Then, one decision epoch lasts for Tp = LpTs seconds. 

Assumption 2.4: Let 7 ^*^ denote the instantaneous signal-to-noise ratio (SNR) of the wireless fading 
channel. { 7 ^*^ ^ stationary random process that is independent of {/^*^}. Let the full SNR variation 

range of the wireless channel is partitioned into K non-overlapping regions {[ri,r 2 ), [r 2 ,r 3 ),..., [Tp-jOo)}, 
where Ti < r 2 < ... < Tk- Denote E = {1, 2,... , AT} the channel state at t. We say = k 
if 7 ^*) E [TfcjTfc+i). The channel is modeled by an LSMC ifTOll according to the channel parameters, 
e.g., maximum Doppler shift, average SNR and statistics. The channel dynamics is characterized by the 
channel state transition probability P/j(t)/i(t+i) = The scheduler knows the value of 

to support the decision at each decision epochj^ 

Assumption 2.5: The order of the events in each decision epoch is shown in Lig. |2] At the beginning 
of the decision epoch t, the scheduler observes the system state and takes an action A cost 
is immediately incurred after . Then, /(*) packet(s) arrives at queue. The definitions of 
and will be given in Section Hl-B I 


B. Markov Decision Process Modelling 

Let 6^*) E = {0,1,... ,Lb} be the number of packets held in the queue at decision epoch t. We 
call the queue state/occupancy. We define x^*) = )&X = Bx'H as fhe system state at 

t. Based on Assumptions 12.21 and 12.51 the variation of the queue state is governed by Lindley recursive 

^The value of channel state can be obtained by using some channel estimation technique, e.g., (H. We assume that 
the channel state does not significantly change from one decision epoch to another or when some pilot symbols are used to 
estimate the channel state. In this paper, we assume the perfect channel estimation and that the value of is known before 
the decision making, determining the value of at each decision epoch t. 
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aW i /« 

J 

decision epoch t 

Fig. 2. Events happen in decision epoch t in order: (1) system state is observed; (2) action is taken; (2) immediate 
cost , a*-*^) is incurred; (3) packet(s) arrive(s) at queue. 


equation ifT^ 


b := min|[6-a]+ +/, L^j, 


( 1 ) 


where [t/]+ = max{0,y}. Therefore, the queue transition prohahility can he worked out hy the statistics 
of {/^*^} as 


=Pr(6(*+')|6«,aW; 


= 


Pr(/W = 6(*+i) - [feW - aW]+) 6^*+^ < Lb 

Pr(/C) = 0 6l‘+» = Lb 


( 2 ) 


Because of the independence of packet arrival and channel fading processes as assumed in Assumption l2.41 
the system state transition prohahility is given hy 




— Prtx' 

(t)x(t+i) — urt^x 

(t) 




_ -pa^ ' p 


( 3 ) 


Define the immediate cost c: X x A ^ 


as 


c(xW,aW) = c(6W,/iW,aW) 

= c,(6W,aW) + ct,(/iW,aW), 


( 4 ) 


where Cq and ctr quantify the costs associated with the queueing effect in the data link layer and 
transmission power consumption in the physical layer, respectively. We define Cq as 


Cq(5W,aW) = wEf [[feW - oW]+ + /W - Lb 


( 5 ) 


where tu > 0 is a weighf factor. Here, Cq is proporfional to fhe expecfed number of losf packefs due to 
queue overflow. We define ctr as 


l.of hW 


( 6 ) 
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where < 0.2 is a bit error rate (BER) constraint. Here, ctr is an estimation of the minimum power 
required to transmit bits/symbol in channel state h that will result in an average BER no greater than 
Pg. As explained in |I5|, the definition of ctr is based on a BER upper bound for m-QAM transmission 
derived in Il20l . 

Note, by using w, the immediate cost c in dH) is in fact a weighted sum of the losses incurred in data 
link and physical layers. The weight factor w can be regarded as the priority of minimizing the cost 
incurred in the data link layer as opposed to that in the physical layer. 

C. Objective 

The optimization objective of the scheduler is to minimize the discounted sum of the immediate costs 
over decision epochs, which can be mathematically described as 

OO 

minE [E /3‘c(xW,aW) 
t=o 

where /3 € [0 1) is the discount factor and ~ Pr(-|x(*), /3 describes how far-sighted a 

decision maker is: Since fi assigns exponentially decaying weights to the immediate costs in the future, 
the scheduler becomes more far-sighted as /3 ^ 1. In addition, /3 < 1 ensures that the limit of the infinite 
series is finite. 


x(0) = X 


Vx G X, 


(V) 


D. Dynamic Programming 


Based on Assumptions 12.21 and 12.41 the MDP model in Section ITl-B I is stationary (time-invariant). It is 
proved in ifTTl that there exists an optimal policy that is stationary and deterministic for all discounted 
stationary MDPs with finite state and action spaces. Therefore, by defining the expected total discounted 
cost under a stationary deterministic policy 9 ■. X ^ A ns 


Ve{^) = E 


^/3*c(xW,0(xW)) 


1=0 


X 


( 0 ) 



problem (|7]) is equivalent to 


( 8 ) 


minV6)(x), Vx G A, 
0 

Since Vg can be expressed by Bellman equation ll^ 


(9) 


14 (x) = c(x, a) -f ^ (10) 

x' 

problem ® can be solved by DP IfTTl 


l/(x) := min Q(x, a), Vx G A”, 
a^A 


( 11 ) 
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where 

(5(x,a) = c(x,a)(12) 

x' 

The optimal policy 9* is determined hy 

6'*(x) = argminjc(x,a)+ (x') I , Vx G T", (13) 

a6.4 ^ 

where N is the iteration index when (fTTI) convergeso 

Note, from (ITOl ) to (fT3l) . we drop the notation t and use x = (6, h) and x' = (6', h') to denote states in 
the current and next decision epochs, respectively, because the MDP under consideration is stationary. 


III. MoNOTONic Optimal Transmission Policy 


This section examines the monotonicity of the optimal transmission policy in queue and channel states. 
We first clarify some related definitions and theorems as follows. 

Definition 3.1 (Submodularity K22\l . /|23]/).- Let e* G he an n-tuple with all zero entries except the 
ith entry being one. /: Z” i-A M is submodular if /(x + e*) + /(x + oj) > /(x) + /(x + e, + e^) for 
all X G Z"^ and 1 < i, j < n. 

Definition 3.2 (L^ -convexiy (EH/).' / : Z i-A M is L^-convex in x if f{x + 1) + f{x — 1) — 2/(x) > 0 
for all x; / : Z” i-A R is L^-convex in x if ?/)(x, C) = /(x — ^1) is submodular in (x, ((), where 
1 = (1,1,...,1) GZ- 


In monotone comparative static^ it is proved that minimizing a submodular or -convex function 
results in a monotonic optimal solution, which we summarize in terms of function Q in the following 
two lemmas. 

Lemma 3.3: If (5(x, a) is submodular in (x, o), F(x) = mina^A i3(x, a) is submodular in x and 
a*(x) = argminag _4 Q(x, a) is nondecreasing in x. 

Proof: This lemma is due to the properties of submodular functions |[25l : If /(x, y) is submodular 
in (x, y), /*(x) = argminy /(x, y) is submodular in x, and y*(x) = argminy /(x, y) is nondecreasing 
in X. ■ 


^It is proved in 1111 that the sequence {P^"^(x)} generated by i ll It converges to P*(x) for all x, where P*(x) is the 
minimum and 0*(x) = arg minag>t{c(x, a) -f /3 Px,j/P*(x')} is the minimizer of l[9]l. Usually, a small threshold e > 0 

is applied so that (HD is terminated when ||U^^^(x) — < e for all x. In this paper, we set e = 10 

''Monotone comparative statics studies the situation that the optimal solution varies monotonically with the system parameters 

(Ml 
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Lemma 3.4: If (5(x, a) is L^-convex in x, I^(x) = minag _4 (5(x, a) is L^-convex in x and a*(x) = 
arg minag^ Q(x, a) is nondecreasing in x. In addition, a*(x + 1) < a*(x) + 1 for all x. 

Proof: This lemma is due to the properties of L^-convex functions 12^ : If /(x, y) is L^-convex in 
(x,y), /*(x) = argminy/(x,y) is L^-convex in x, and y*(x) = argminy/(x,y) is nondecreasing in 
X and y*(x +1) < y*(x) + 1. ■ 

Remark 3.5: -convexity differs from suhmodularity in that the increment of the resulting optimizer 
a* from x to x + 1 is hounded hy 1. This is called the bounded marginal effect |[27l . 

In this paper, the idea for proving the existence of a monotonic optimal policy is to show that the 
-convexity or suhmodularity is preserved hy minimization operation in each iteration in DP. The related 
results will he derived in Proposition 13.61 in Section IIII-AI and Proposition 13.91 in Section IIII-RI 

In the remaining context of this paper, we clarify that when we say that function /(x,y) has some 
property in x we mean that /(x, y) has this property in x for all fixed value of y. For example, if /(x, y) 
is nondecreasing in x, then /(x+,y) > /(x_,y) for all y if x+ > x_. 

A. Nondecreasing Optimal Policy in Queue State 

Based on Lemma lT4l we show that the optimal transmission policy is always nondecreasing in queue 
state. 

Proposition 3.6: For x = {h,h) and x' = {b',h'), if Q(x,a) is L^-convex in (b,a) and nondecreasing 
in b for all L(x') that is nondecreasing and L^-convex in b\ the optimal policy 9*{x.) is nondecreasing 
in b, and 6*{b + l,h) < 0*{b, h) + 1 for all (6, h). 

Proof: Assume is nondecreasing and L^-convex in b'. Then, Q(x,a) = c(x,a) + 

L^-convex in (b, a) and nondecreasing in b. According to Lemma l3Al (x) = 
minag^ (5(x, a) is L^-convex in b. Let a^'^~^\b,h) = arg min^g^ <5(6,/i, a). Since 

V^^\b + l,h) -V^^\b, h) 

= Q{b + 1, h, {b + 1, h)) - Q{b, h, (6, h)) 

> Q{b + 1, h, (6 + 1, h)) - Q{b, h, {b + l,h))> 0, 

t/(’^)(x) is also nondecreasing in b. Let DP starts with I/(°)(x) that is nondecreasing and L^-convex 
in b, e.g., = 0 for all x. Then, hy induction, DP terminates at A^th iteration with c(x,a) -|- 

/3Ex'^x“x'E^H x') L^-convex in (b,a). According to Lemma [3Al the optimal policy 9*{x) determined 
hy (fT3]) is nondecreasing in b and 9*{b + l,h) < 9*{b, h) + 1 for all (5, h). ■ 

Theorem 3.7: The optimal policy 6*{x) is nondecreasing in b and 9*{b + l,h) < 9*{b, h) + 1. 


August 25, 2015 


DRAFT 






IEEE TRANSACTIONS ON COMMUNICATIONS 


II 


Proof: According to O, the queue state at the next decision epoch h' can he expressed hy the 
queue state at the current decision epoch hhy h' = min{[6 — a]Ls}. The Q function in can 
he rewritten as 


Q{b, h, a) = c{b, h,a)+l3Y^ PwV{b', P)) 

h' b' 

= ctr{h,a) + wEf [b-a]'^ + f-LB +'^Phh'Ef y(min{[6 - a]++/, L^},/i') 


Define ipo{yJ) 
express Q hy 


[y]^ + f-LB 


and V{y, f, h) = wipo{y, f) + 13V (min{[?/]+ + /, Lb}, h). We can 


Q{b,h,a) = ctr{h,a) + Y,Phh'Ef[V{b - a, f,h')]. (14) 

h' 

Then, Q is nondecreasing in b for all V{b',h') that is nondecreasing in b' (see proof in Appendix iBl). 
and Q is L^-convex in (5, a) for all V{b',h') that is L^-convex in b' (see proof in Appendix 0. By 
Proposition 13.61 theorem holds. ■ 

Remark 3.8: Theorem 13.71 holds unconditionally, i.e., the monotonicity of 8* in queue state b and the 
hounded marginal effect 6*{b + l,h) < 6*{b,h) + 1 for all b always exist regardless of the values of 
system parameters such as the weight factor w, the discount factor /3, the state transition probability 


B. Nondecreasing Optimal Policy in Queue and Channel States 

Based on Lemma 1331 and the results in Theorem 13.71 we derive the sufficient condition for the optimal 
policy to be nondecreasing in both queue occupancy and channel states. 

Proposition 3.9: If Q(x, a) is submodular in (x, a) = {b, h, a) for all V (x') that is nondecreasing and 
submodular in x' = (6', h'), the optimal policy 6**(x) is nondecreasing in x = (6, h). 

Proof: By using Lemma 13.31 this propostion can be proved by following the same induction method 
as in the proof of Proposition 13.61 ■ 

Theorem 3.10: If Phh’ is first order stochastic nondecreasin^ in h and 


w <ctr{h + 1, a) + Ctr{h, a + 1) - ctr{h, a) - ctr{h + 1,0 + 1) (15) 

for all {h,a), the optimal policy 0*(x) is nondecreasing in x = {b, h). 

Proof: If Phh' is first order stochastic nondecreasing in h and inequality (fTSl) holds for all {h, a), we 
can prove that Q is submodular in (b, h, a) for all V (x') that is submodular in x' = (6', h') (see proof 
in Appendix 10. Therefore, by Proposition 13.91 theorem holds. ■ 


^See Appendix lAl for the definition and explanation of first order stochastic dominance. 


August 25, 2015 


DRAET 










IEEE TRANSACTIONS ON COMMUNICATIONS 


12 


In the following two corollaries, we show that Theorem 13.101 is in fact conditioned on the value of the 
weight factor w and channel statistics. 

Corollary 3.11: If 


^ 21n(5Pb),l 1 , 

1.5 Th+i 


for all h, inequality (fTSl) holds. 
Proof: Since 


Ctr{h + 1, a) + Ctr{h, a + 1) - ctr{h, a) - ctr{h + 1, a + 1) 

2“ln(5n). 1 1 


> - 


1.5 

21n(5A), 1 


T/i+i 
1 , 


) 


1.5 




(16) 


inequality (flSl) holds if tu < — 


21n(5Pb) ^ 1 


(IT- 


holds for all h. 


1.5 irh Efe+i. 

The condition that P^h' is first order stochastic nondecreasing in h in Theorem 13.101 is not hard to 
satisfy. The following corollary shows that it holds when the channel experiences slow and flat fading with 
respect to the duration of decision epoch T^. Here, slow means that the normalized Doppler frequency 
shift JdTd < 0.01, where fr) is the maximum Doppler shift. 

Corollary 3.12: If the channel experiences slow and flat fading with respect to decision duration Tn, 
the channel transition prohahility is first order stochastic nondecreasing in h. 

Proof: Because the fading is slow and flat, the channel transitions can he worked out hy level 
crossing rate (LCR) itTOll and only happens between adjacent states, i.e., h' € {h — \,h,h + 1}. And, 
Phh' = Ph'h and Pfi^i <C. Phh for all h' / h. According to Definition lA.ll for nondecreasing u, Phh' is 
first order stochastic nondecreasing in h because 


P{h+i){h+i)'u({h + l)'j - ^Phh'u{h') 

(ft+l)' h' 

' > 0 , 


> - ‘^Ph{h+I))[u{h + 1) - U 

where 1 - 2P^h+i) > 0 because P^h' < Phh and Y,h' Phh' = 1- 


(17) 


C. Examples 

We construct an adaptive m-QAM system as in Fig.[T] We assume that the decision rate is lO^decisons/sec, 
i.e., the duration of each decision epoch is To = 10“^ second. We set queue length Lb = 15, the 
maximum action = max^ = 5 and the BER constraint Pg = 10“^. The number of packets arrived 
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Fig. 3. The optimal policy 6* in a 16-queue state 8-channel sate cross-layer adaptive m-QAM system as shown in Fig. [T] 
where BER constraint Pe = 10“®, weight factor w = 1. The channel experiences slow and flat Rayleigh fading with average 
SNR being OdB and maximum Doppler shift being lOHz. In this system, both Theorems 13.71 and 13. 101 hold. 9* is nondecreasing 
in queue state b and channel state h. Since the monotonicity in b is established by L**-convexity, the increment of 9* in b is 
restricted by a bounded marginal effect, i.e., 9*{b -F 1, /i) < 9*{b, h) + 1 for all {b, h). 



is Poisson distributed: /(*) ~ Pois(3) for all t. The optimal policy 9* is searched by DP with a discount 
factor /3 = 0.95. We vary the system parameters to show the optimal transmission policies as follows. 

Assume the channel experiences slow and flat Rayleigh fading. Let the average SNR be OdB and the 
maximum doppler shift be lOHz (so that the normalized Doppler frequency shift is foTo = 0.01). We 
model the channel by an 8-state FSMC by using equiprobable SNR partition method lITOl . We first set 
u) = 1. In this case, Theorem 13.71 holds. By working out the SNR boundaries by the FSMC method 
described in ifTOll . it can be shown that Corollaries 13.1 II and 13. 121 are satisfied. Therefore, Theorem 13. 101 
also holds. As shown in Fig. [3j 6* is nondecreasing in both b and h, and the increment of 6* from b 
to 6 + 1 for any fixed channel state h is bounded by 1. From Fig. [3l we can also see the differences 
between L^-convexity and submodularity in terms of the resulting optimal policy: Since the monotonicity 
of 9* in b is due to the -convexity, the increment of 9* from 6 to 6 + 1 is no greater than 1; Since the 
monotonicity of 9* in h is due to the submodularity instead of -convexity, the increment of 9* from 
h to h + I may exceed 1, e.g., when 6 = 6, the increment of 9* from 6, = 7 to 6, = 8 is 2. 

We then show examples that the monotonicity of 9* in h is not guaranteed if either conditions in 
Theorem 13. 101 is breached. We first change w to 400 to breach the condition (fTSl) . The optimal policy is 
shown in Fig.lH We then set w back to 1 and change the channel transition probability as Pr{h'\h = 7) = 0 
for all h' except Pr(/i' = 8\h = 7) = 1 and Pr(/i'|/i = 8) = 0 for all h' except Pr(/i' = l\h = 8) = 1. 
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Fig. 4. The optimal policy 6* in a 16-queue state 8-channel state cross-layer adaptive m-QAM system as shown in Fig.[T] where 
BER constraint Pe = 10~®, weight factor is w = 400. The channel experiences slow and flat Rayleigh fading with average 
SNR being OdB and maximum Doppler shift being lOFlz. In this system, Theorem 13. 101 does not hold. 6* is not nondecreasing 
in h for all b, e.g, d*{b, h+ 1) < 6*(b, h) when & = 3 and h — 2. 


Fig. 5. The optimal policy 6* in a 16-queue state 8-channel state cross-layer adaptive modulation system as shown in Fig.[Tl 
where BER constraint Pb — 10“®, weight factor is m = 1. But, the channel transition probability is not first order nondecreasing, 
i.e.. Theorem 13.101 does not hold. Therefore, 6* is not nondecreasing in h for all 6, e.g, 9*(b,h -|- 1) < 6*{b,h) when b = 2 
and h = 5. 
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The purpose is to satisfy (fTSl) but breach the stochastic dominance of Phh' ■ The optimal policy is shown 
in Fig. [51 It can be seen from Figs. |4] and |5] that 9* is not nondecreasing in h for all h. But, since 
Theorem 13.71 holds unconditionally, 6*{b, h) < 6*{b+ 1, h) < 9*{b, h) + 1 for all (6, h) in Figs. |4]and|5] 

IV. MoNOTONic Policy Iteration 

Consider the DP algorithm in (ITTI) . In each iteration, a minimization operation should be done for 
each X in the system state space A !; in each minimization, the value of Q is calculated for each a in A', 
and obtaining each value of Q requires multiplications over all values of x' € A. The time complexity 
in each iteration in DP is 0(|T’p|^|). Since \A\ = the complexity grows quadratically if the 

cardinality of any tuple in the state variable increases. If the system in Fig. [T]is extended to a multi-user or 
multi-channel one, the time complexity of DP may grow exponentially with both the number of users and 
the number of channels. For example, if the wireless channel in Fig. [T]is an MIMO (multiple-input and 
multiple-output) one that contains m subchannels, then \ A\ = lyBUTfl"*, which means the time complexity 
of DP grows exponentially with m. In this and next sections, we discuss how to utilize the monotonicity 
results derived in Section |III| to relieve the computational complexity of DP. For this purpose, we first 
propose an MPI algorithm in this section and discuss how to convert ® to a discrete minimization 
optimization and apply a stochastic approximation algorithm in Section jV] 

MPI is a modified DP algorithm that was first introduced in ifTTl . |[T3l based on the submodularity of 
DP. The idea is to modify the DP function in (fTTI) as 

V(x) := min Q(x, a), Vx G T” (18) 

aGA{x) 

where ^(x) is a set or selection depending on state x and is defined as follows. 

Let 0(x) = minag_ 4 Q(x). If 6 is nondecreasing in b (e.g., due to the submodularity of Q), instead of 
searching the whole actions space A to get V (x), we just need to consider those actions that is no less 
than 6{b — l,h). Therefore, .A(x) is defined as 

,A(x) = A{b, h) = {a: 9{b — l,h) < a < A^}- 

Note, ^(0, h) = A, and (IT8] ) should be applied in the increasing order of the value of b in each iteration 
so that |,A(x)| is progressively reducing. MPI and DP converge at the same rate. But, the complexity in 
each iteration is OdfFpl^l) for DP and 0(|A’p|^(x)|) for MPI. Since |^(x)| < |^|, the computation 
load in MPI is less than that in DP. 

In the MDP model considered in this paper, we can show that the complexity can be further reduced. 
Since Theorem l3.7l holds unconditionally, Q is L^-convex in (6, a), and the increment of 0 in 6 is restricted 


August 25, 2015 


DRAFT 


IEEE TRANSACTIONS ON COMMUNICATIONS 


16 



Fig. 6. The time complexity of DP, MPI based on submodularity and MPI based on -convexity in terms of the average 
number of calculations of Q per iteration. The system settings are the same as in Fig. |4] except that the number of channel states 
in FSMC is varied from 2 to 10. 


by a bounded marginal effect, i.e., 6{b, h) must be either 9{b — 1, h) or 6{h — l,h) + 1. Therefore, we 
can define -4.(x) as 

>f(x) = {9(b — 1, h),9{b — l,h) + 1}. 

Therefore, the time complexity of the MPI algorithm based on the L^-convexity of DP can be reduced 
to OdfPp). We use the system settings as in Fig. |4] and show the complexity of DP, MPI based on 
submodularity and MPI based on -convexity by varying the number of channel states \T-L\ in FSMC 
from 2 to 10. The results are shown in Fig. In this figure, the time complexity is obtained as the 
number of calculations of Q averaged over iterations. It can be seen that the complexity of the two MPI 
algorithms is less than that of DP. In addition, the complexity of the MPI algorithm based on -convexity 
is much lower than the one based on submodularity. 

V. Discrete Stochastic Approximation 

This section considers using simulation-based algorithm to relieve the complexity of DP. The idea 
is to convert ® to a minimization problem over queue thresholds and use a stochastic approximation 
algorithm to search the optimizer. Stochastic approximation algorithms have been used in other cross¬ 
layer adaptive modulation systems before. For example, it is shown in lITSl that SPSA algorithm is able 
to learn the optimal randomized policy in an m-QAM congestion game. In this section, we show that 
stochastic approximation algorithm can also be used to search the optimal deterministic policy in the 
adaptive m-QAM system in Fig. [T] 
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A. Constrained Multivariate Minimization 


Based on Assumption 12.31 1^| < \B\, i.e., the cardinality of the action set A is less than that of the 
queue state set B. Since the optimal policy is always nondecreasing in queue state b (Theorem 13.71) . 0* 
can he expressed hy 

/ 

Am <b< Lb 


r(x) = < 


(19) 


1 rhi<b< 

^0 0<b<rhi 

Let i e {I,..., Am), (kli is the optimal queue threshold when 9* is switching from action i — 1 to z in 
channel state h. Define contains a set of queue thresholds that are sufficient 

to descrihe a monotonic policy for all b for a certain value of h. Construct a queue threshold vector as 
cf) = {(pi, 4 * 2 , ■ ■ ■, contains all queue thresholds that are sufficient to descrihe a policy Omono that 

are nondecreasing in b hy 


,(x) = < 


0 {i- b> 4>hi} = 0 

max{z: b > phi} otherwise 
By doing so, ([9]) can he converted to a constrained multivariate minimization problem as follows. 
Theorem 5.1: The optimization problem ([9]) is equivalent to 


( 20 ) 


min J(0) 

s.t. Phi - Phi+i < 0, (21) 


where d> = {0,1,... ,Lb + and 


J(0) = ^ E ^ /3‘c(x(*), 6»mono(x^*^ 


1=0 



( 22 ) 


Proof: Let the set ©mono contains all the deterministic stationary policies that are nondecreasing in 
queue state b. According to ([8]), J{p) = 'ffy. where 6*mono G ©mono is determined by p via (l20l) . 

Then, (l2TI) is in fact the problem 

min^L0„(x). (23) 

^mono 

X 

Since there always exists an optimal policy 9* that is nondecreasing in b (Theorem 13.71) . 9* G ©mono- 
Therefore, (|9l) is equivalent to (|2^ . ■ 
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* 10 
-Q- 

0 

Fig. 7. The optimal queue threshold vector cj>* extracted from the optimal transmission policy in Fig.[^ 

* 

-Q- 


Fig. 8. The optimal queue threshold 




Remark 5.2: Since the objective function J is an expectation and 0 only takes integer values, (|2T1) is 
a discrete stochastic minimization problem with inequality constraints. 

Remark 5.3: The constrains in (|2T]) is due to the monotonicity of 0mono in b. Given ffmono £ ©mono 
(phi is determined as 

(phi = min{6: 9mono{b, h) = i). (24) 

Since 0mono is nondecreasing in h, the queue thresholds should satisfy (phi < (ph 2 < • • • < (phA^ ■ See 
examples in Figs. |7] and [S] 


B. Discrete Simultaneous Perturbation Stochastic Approximation 

Consider using stochastic approximation algorithm to solve problem (|2T]) . We present a DSPSA 
algorithm in Algorithm 1. This algorithm was first proposed in Il28l . It uses gradient based line search 
iterations and augmented Lagrangian methocj® to solve an inequality constrained stochastic minimization 


^Augmented Lagrangian is a combination of penalty and Lagrangian methods for solving constrained minimization problems. 
It was suggested in I29l to prevent the situation when the penalty coefficient goes to infinity with the iteration index as in 
quadratic penalty method. For more details on augmented Lagrangian, we refer the reader to (H. 
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Algorithm 1: DSPS A HH 

input : initial guess (a D-tuple with D = {HiAm), total number of iterations N, step size parameters A, 
B, ai and a 2 and the penalty coefficient R 

output; 

begin 

set Lagragian multiplier = 0 for all h and i\ 

for n=l to N do 


obtain g at ^ by using simulated objective function J; 

update estimation by 

max{0,A|” -f ; 

h i 

update Lagrangian multiplier by 

= max|o,AE^^ 

for all h and i; 

endfor 

end 


problem. It produces an estimation sequence of the minimizer with G $ = [0, Lb + 

In Algorithm 1, Vhi is the constraint function in (l2Tl) . i.e., 

4^hi 4^hi+l, (25) 

and n|,(0) is a projection function that returns a closest integer point (by Euclidean distance) in <I) to 
The implementation details of Algorithm 1 are described as follows. 

1) Obtain g.- Since (l2T]) is a discrete optimization problem, we use the gradient calculation method 
based on discrete midpoint convexity in ifTTl . The method is to generate A = (Ai ,..., Ad) with each 
tuple Ad G {—1,1} being independent Bernoulli random variables with probability 0.5. The dth entry 
of g(^^”^) is obtained by 

+ LA) - A.-‘. (26) 
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estimation 


Fig. 9. Convergence performance of DSPSA when Lb = 15, Am = 5, ~ Pois(3) and Pe = 10~®. The channel is 

Rayleigh fading with average SNR being OdB and maximum Doppler shift being lOFlz. It is modeled by a 8-state FSMC. The 
weight factor is w = 100. 


2) Obtaining J: J is the noisy measurement of the objective function J. The method of obtaining 
J(0) is to simulate the sequence Here, is governed by the Markov chain with the state 

transition probability being Pr(x (1+1) |x(^)) = ^mono(x) is determined by ^ via ([201). We 

obtain J as 

T 

J{^)= ^ ^/3*c(x(*),6»mono(x^*^))- (27) 

xColgA" t=0 


T is the simulation length and depends on /3, i.e., the simulation stops until the increments over several 
successive decision epochs are blow a small threshold (10“^). 

3) Obtaining Vvhi{4> ).’ Vvhi{4> ) is the gradient of the constraint function Vhi at cj) ■ Since Vhi 
is linear. Vvhi{4> ) is simply the coefficients in Vhi- 

4) Step Size Parameters and Penalty Coefficient: The step size parameters, A, B, ai and 02 , and the 
penalty coefficient R in Algorithm 1 are crucial for the convergence performance of DSPSA algorithms. 


In this paper, we adopt the method of choosing A, B, ai, 02 and R suggested in 1281 . |[30r 
B = 100, oi = 0.602, a 2 = 0.1 and R = 10. DSPSA always starts with = 0. 


A = 0.015 


^The authors in 1301 presented an implementation guide for the designers to choose the step size parameters when applying 
simultaneous perturbation stochastic approximation (SPSA) method for practical problems. The experiments in the subsequent 
works, e.g., 1281 , proved that this method could provide good convergence performance for SPSA algorithms. 
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Fig. 10. Convergence performance of DSPSA when the parameters are the same as in Fig. [9] except that w = 400. 



(a) ]), the value of the objective function, at the nth 

estimation 



tuple of </)* 


Fig. 11. Convergence performance of DSPSA when we set w = 300 for the first 5000 iterations and change to ui = 20 for 
the second 5000 iterations. The other parameters are the same as in Fig. [9] 


5) Complexity and Convergence Performance: One advantage of DSPSA is its low complexity: The 
estimation of g in each iteration only requires two simulations of the objective function. It is also 
proved that the estimation sequence generated hy DSPSA is able to converge to the local minimizer for 
problem (|2T]) probabilistically ESI . We run experiments to show the convergence performance of DSPSA. 
We set duration of decision epoch To = 10“^, queue length = 15, the maximum action Am = 5, 
fit) 

Pois(3) and the BER constraint Pg = 10 The channel is Rayleigh fading with average SNR 
being OdB and maximum Doppler shift being lOHz. It is modeled by a 8-state FSMC. We set discount 
factor /3 to 0.95 and the total number of iterations N in DSPSA to 5000. We first choose w = 100 and 
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apply DSPSA to search the optimal threshold vector. The convergence performance is shown in Fig. |9l 
The optimal threshold vector cf)* is determined hy the optimal policy 6* searched hy DP. We then set 
w = 400 and apply DSPSA again. The results are shown in Fig. (TO] It can he seen that DSPSA converges 
to the optimum in both figures. Based on Figs. l9] and [TOl the convergence speed of DSPSA when w = 400 
is faster than that when w = 100. We do not have the direct proof of the rate of convergence of DSPSA. 
But, we provide two possible reasons why DSPSA converges faster with higher value of w. One is the 
shape of the objective function J in the neighbourhood of the local minimizer since a study in lISTll 
shows that stochastic steepest descent algorithms converge faster for strongly convex functions than for 
non-strongly convex functions on average. The other reason is the step size parameters. The step size 
parameters are important for the convergence performance of stochastic approximation algorithms [301. 
In this paper, we follow the suggestions in |[30l to set the values of step size parameters. But, there may 
exist a different set of step size parameters with which the convergence performance when w = 100 
could be improved. After all. Figs. l9l and [TOl show that DSPSA is able to approximate an estimator of the 
optimal queue threshold vector where the value of the objective function is very close to the optimum. 
One may be interested in studying how to speed up the DSPSA algorithm. But, it is beyond the scope 
of this paper and could be a proposal of the research work in the future. 

The other advantage of DSPSA is that it does not require the full knowledge of MDP Since DSPSA is 
a simulation-based algorithm, it can be implemented if only a simulation model is available. Therefore, 
DSPSA is suitable for real-time applications. Fig. [TT] shows the convergence performance of DSPSA 
when we change the value of w. We use the same parameters as in Figs. 0 and [TOl We apply DSPSA 
and change the value of w from 300 to 20 at the 5000th iteration. It can be seen that DSPSA is able to 
adaptively track the optimum and optimizer accordingly with the changing value of w. The results also 
implies that DSPSA can be combined with model-free learning algorithms for the scheduler to learn the 
optimal transmission policy in real time. 


VI. Conclusion 

We studied the monotonicity of the optimal policy in an MDP modeled cross-layer adaptive m-QAM 
system. It was proved that the optimal policy was always nondecreasing in queue state due to the 
L^-convexity of DP. By observing the submodularity of DP conditioned on the weight factor in the 
cost function and the channel statistics, we derived the sufficient conditions for the optimal policy to be 
nondecreasing in both queue and channel states. We showed that L^-convexity differed from submodularity 
in that the variation of the resulting optimal policy was not only monotonic but also restricted by a bounded 
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marginal effect. We presented two low complexity algorithms: MPI based on L^-convexity and DSPSA. 
We showed that MPI based on -convexity incurred a much reduced the computational complexity than 
DP ifTTl and MPI based on submodularity ITSlI . For DSPSA, we ran numerical experiments to show its 
convergence performance, where we showed that it allowed the decision maker to adaptively trace the 
optimal policy. 

It should be pointed out that the algorithms for finding the monotonic optimal policy in cross-layer 
adaptive m-QAM system is not restricted to MPI and DSPSA. One can use the results in Section In] to 
propose more efficient algorithms. For example, one may consider random search or simulated annealing 
algorithms for solving problem (|2^ . This could be one direction of the research works in the future. In 
addition. Propositions 13.61 and 13.91 are not restricted to expressions of Cq and ctr, i-e., they can be utilized 
to derive the monotonicity of the optimal policy in other queue-assisted cross-layer transmission control 
problems. Finally, as discussed in Section IV-B51 to discuss how to speed up the DSPSA algorithm when 
it is applied to cross-layer modulation system could be another direction of the research works in the 
future. 


Appendix A 

Stochastic dominance is the stochastic ordering that used in decision analysis. It describes a probability 
distribution is superior to another in terms of the expected outcomes or costs. In this paper, we use the 
concept of first order stochastic dominance defined blow to show the monotonicity of the optimal policy 
in channel states. 

Definition A.l (first order stochastic dominance is / I32I/ ).• Let p{x) be a random selection on space X 
where x conditions the random selection, then p{x) is first order stochastically nondecreasing in x if 
E[u(p(x+))] > E[u(p(x_))] for all nondecreasing functions u and > x_. 

Appendix B 

Assume L(x') is nondecreasing in b'. It is straightforward to see that (po{y,f) is nondecreasing in 
y. Since min{[y]''' -|- f^Ls} is nondecreasing in y, /?L(min{[?/]''' -|- /, Lb},/i') is nondecreasing in y. 
Therefore, V(y, /, h) is nondecreasing in y. Consider the monotonicity of Q in b. Since 

Q{b + 1, /i, a) — Q{b, h, a) 

= Phh'^f[V{b -a + l,f, h') - V{b - a, f, h')] > 0, (28) 

h' 

Q is nondecreasing in b. 
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Appendix C 

Since the additions of two L^-convex functions are L^-convex ll26l . Q is L^-convex if both ctr{h,a) 
and PhhiE,f[V{b — a,f,h')] are L^-convex in (6, o). Consider the L^-convexity of ctr- Since ctr is 
just a function of a, it suffices to show that ctr is -convex in a. ctr is -convex in a since 2“ is convex 
in a. 

Consider the L^-convexity of Phh'^f[V{b — a, f, h')]. Since the expectation of L^-convex function 
is -convex Il26l . it suffices to show the L^-convexity of V{y,f,h) in (6, a). By Definition 13.21 we 
need to prove that il:{b,a, f,h',C,) = V{b — a,f,h') is submodular in (6, a). But, by Definition 13.11 
ip{b,a, f,h'X) is submoular in (6, a) since 

-f 1, a, f, h', C) -f tpib, a + lj, h', () - 'ipib, a, f, h', () -'ijj{b + l,a + 1, /, h', C) 

= V{b-a + l, /, h') + V{b-a-l, /, h') - 2V{b - a, /, h') > 0 (29) 

for all (6, a). See the proof in Appendix iDl for the last step in (l29l ). Therefore, V{y,f,h) is C^-convex 
in y, and Q is -convex in (6, a). 


Appendix D 

Assume that C(x') is L^-convex in b', C(y,/,/i) is L^-convex in y because 


V{y + 1, /, h) + V{y - 1, /, h) - 2V{y, f, h) 


‘fo{y + 1, /) + <^o(y - 1, /) - 2v9o(y, /) + /3 

min{[y-f 1]+ -f/, Lb},/ i) 

+ V(^mm{[y- 1]+ + f,LB},h) - min{[y]+ + f,LB},h)^ 

/ 

0 

y <0 

w > 0 

y = 0 

< /3(v(y + l,h) + V(y -l,h)- 2V(y, /i)) 

>0 0 < y < Lb - f > 

wP/3(^V(LB-l,h)-V(LB,h)) 

II 

to 

1 

0 

Lb — f < y < Lb 
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where y = y + f- Let = argmiiia Q{Lb, h, a) and ~ argmina Q{Lb — 1, h, a). We have 

V{b,h) -V{b + l,h) 

= Q{b,h,al)-Q{LB,h,al^,) 

> Q{b, h, al) - Q(b + 1, h, al) 

= Y, Phh'^f \v{b - a, /, h)-V{b+l- a, /, h)] > -w. (30) 

h' 

So w + I3 (v{Lb — l,h) — V{Lb, h)^ > w — w/] > 0. Therefore, V is -convex in y. 

Appendix E 

Assume that L(x') is suhmodular in x' = {b',h'). Q is suhmodular in (6,/i) because 
Q{b, h + l,a) + Q{b, h,a + l) - Q{b, h, a) - Q{b, h + l,a + l) 

= Ctr{h + 1, a) + Ctr{h, a + 1) - ctr{h, a) - ctr{h + 1, a + 1) 

'V{b - a, /, {h + 1)0 -V{b-a- If, (h + 1)0 


{h+iy 


Xh+i){h+iy'^f 


+ Y Phh'Kf[vib h') - V{b - a, f, h') 

h' 

>w + Y Phh'^f \v{b - a - 1, /, h') - V{b - a, /, h') 


> w — w >0, 

Q{b + 1, h, a) + Q{b, h + l,a) - Q{b, h, a) - Q{b + l,h + l,a) 

= Y Pih+i)(h+iy^f L(min{[5 - a]++/,L b}, (/i + 1)0 
(h+iy 


(31) 

(32) 


- l/(min{[5 - a + 1]+ + /,Lb}, {h + 1)0 - l/(min{[6 - o]+ + f,LB},h') 

h' 

- l/(min{[6 - a + 1]++/, LB},/i0] >0 


(33) 


and 


Q{b + 1, h, a) + Q(b, h,a + 1) — Q(b, h, a) — Q(b + 1, h, a + 1) > 0. (34) 

Here, (l3TI) is because V is nondecreasing in y as proved in Appendix I bI (l32l) is because Ylih' Phh'^^f V{b— 
a — 1, f,h') — V{b — a, f, h') > —w as proved in (l30l) . (l3^ is because of the submodularity of V (xO 
in x' = {b',h') and first order stochastic monotonicity of Ph^i in h. (l34l) is due to the L^-convexity of 
Q in (6, a) as shown in Appendix O 
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